Memory Optimization in Codelet Execution Model on Many-core Architectures

نویسندگان

  • Yao Wu
  • Guang R. Gao
چکیده

The upcoming exa-scale era requires a parallel program execution model capable of achieving scalability, productivity, energy efficiency, and resiliency. The codelet model is a fine-grained dataflow-inspired execution model which is the focus of several tera-scale and exa-scale studies such as DARPA’s UHPC, DOE’s X-Stack, and the European TERAFLUX projects. Current codelet implementations aim to making fully use of computation resources by balancing their workload in the multi-core and many-core systems. The performance is improved by this method. However, by making use of the features of the codelet model the memory optimization can be also implemented to improve the performance as well as energy efficiency. In this thesis, we focus on the memory optimization on memory workload balance and locality exploitation in the codelet model. As a case study, various versions of FFT algorithms are implemented on IBM Cyclops-64 – a many-core system to demonstrate that the fine-grain codelet execution model is able to execute the codelets that involve different workload on the memory bandwidth in an appropriate order to reduce memory contention and thus improve performance. The experiment result shows that our fine-grain guided algorithm achieves up to 46% performance improvement comparing to a coarse-grain implementation on Cyclops-64. To automatically exploit locality in codelet execution, we provide three optimal or nearly optimal scheduling algorithms based on static information of codelet graph and locality. They have different trade-offs in algorithmic complexity, locality exploitation, program execution time, and energy efficiency. We test and analyze the three algorithms on various applications on an emulation platform of Cyclops-64. The experiment result shows that our algorithms reduce up to 59.7% of global memory access by using local memory to buffer intermediate data between two adjacent codelets

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Unified Runtime System for Heterogeneous Multi-core Architectures

Approaching the theoretical performance of heterogeneous multicore architectures, equipped with specialized accelerators, is a challenging issue. Unlike regular CPUs that can transparently access the whole global memory address range, accelerators usually embed local memory on which they perform all their computations using a specific instruction set. While many research efforts have been devot...

متن کامل

Ultra-Low-Energy DSP Processor Design for Many-Core Parallel Applications

Background and Objectives: Digital signal processors are widely used in energy constrained applications in which battery lifetime is a critical concern. Accordingly, designing ultra-low-energy processors is a major concern. In this work and in the first step, we propose a sub-threshold DSP processor. Methods: As our baseline architecture, we use a modified version of an existing ultra-low-power...

متن کامل

An Implementation of the Codelet Model

Chip architectures are shifting from few, faster, functionally heavy cores to abundant, slower, simpler cores to address pressing physical limitations such as energy consumption and heat expenditure. As architectural trends continue to fluctuate, we propose a novel program execution model, the Codelet model, which is designed for new systems tasked with efficiently managing varying resources. T...

متن کامل

Energy efficient tiling on a Many-Core Architecture

Energy efficiency and power consumption have become an imperative requirement in Computer Architecture. The rising multi-core and many-core era has been motivated by the increasing demand of high performance computations restricted to a feasible power requirement. How to model the energy consumption of many-core architectures in order to propose techniques for the design of energy efficient app...

متن کامل

Towards An Energy-Efficient Scheduler in the Codelet Model

This paper uses the IBM Cyclops-64 manycore architecture as a case study to show how to exploit locality and save energy in the fine-grain dataflow-inspired codelet execution model. State-of-the-art codelet scheduling focuses on dynamic workload balance of codelets (similar to tasks). While this approach may achieve reasonable performance since computation resources are fully utilized, it may n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014